test dataset
The Cells Out of Sample (COOS) dataset and benchmarks for measuring out-of-sample generalization of image classifiers
Understanding if classifiers generalize to out-of-sample datasets is a central problem in machine learning. Microscopy images provide a standardized way to measure the generalization capacity of image classifiers, as we can image the same classes of objects under increasingly divergent, but controlled factors of variation. We created a public dataset of 132,209 images of mouse cells, COOS-7 (Cells Out Of Sample 7-Class). COOS-7 provides a classification setting where four test datasets have increasing degrees of covariate shift: some images are random subsets of the training data, while others are from experiments reproduced months later and imaged by different instruments. We benchmarked a range of classification models using different representations, including transferred neural network features, end-to-end classification with a supervised deep CNN, and features from a self-supervised CNN. While most classifiers perform well on test datasets similar to the training dataset, all classifiers failed to generalize their performance to datasets with greater covariate shifts. These baselines highlight the challenges of covariate shifts in image data, and establish metrics for improving the generalization capacity of image classifiers.
Noise-Robust Abstractive Compression in Retrieval-Augmented Language Models
However, retrieved documents often include information that is either irrelevant to answering the query or misleading due to factual incorrect content, despite having high relevance scores. This behavior indicates that abstractive compressors are more likely to omit important information essential for the correct answer, especially in long contexts where attention dispersion occurs. To address this issue, we categorize retrieved documents in a more fine-grained manner and propose Abstractive Compression Robust against Noise (ACoRN), which introduces two novel training steps. First, we use offline data augmentation on the training dataset to enhance compressor robustness against two distinct types of retrieval noise. Second, since the language model based compressor cannot fully utilize information from multiple retrieved documents and exhibits positional bias, we perform finetuning to generate summaries centered around key information that directly supports the correct answer. Our experiments demonstrate that T5-large, trained with ACoRN as a compressor, improves EM and F1 scores while preserving the answer string, which could serve as direct evidence.
Prediction Intervals for Individual Treatment Effects in a Multiple Decision Point Framework using Conformal Inference
Accurately quantifying uncertainty of individual treatment effects (ITEs) across multiple decision points is crucial for personalized decision-making in fields such as healthcare, finance, education, and online marketplaces. Previous work has focused on predicting non-causal longitudinal estimands or constructing prediction bands for ITEs using cross-sectional data based on exchangeability assumptions. We propose a novel method for constructing prediction intervals using conformal inference techniques for time-varying ITEs with weaker assumptions than prior literature. We guarantee a lower bound for coverage, which is dependent on the degree of non-exchangeability in the data. Although our method is broadly applicable across decision-making contexts, we support our theoretical claims with simulations emulating micro-randomized trials (MRTs) -- a sequential experimental design for mobile health (mHealth) studies. We demonstrate the practical utility of our method by applying it to a real-world MRT - the Intern Health Study (IHS).
- North America > United States > Michigan (0.04)
- North America > Montserrat (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
Consistency-based Abductive Reasoning over Perceptual Errors of Multiple Pre-trained Models in Novel Environments
Leiva, Mario, Ngu, Noel, Kricheli, Joshua Shay, Taparia, Aditya, Senanayake, Ransalu, Shakarian, Paulo, Bastian, Nathaniel, Corcoran, John, Simari, Gerardo
The deployment of pre-trained perception models in novel environments often leads to performance degradation due to distributional shifts. Although recent artificial intelligence approaches for metacognition use logical rules to characterize and filter model errors, improving precision often comes at the cost of reduced recall. This paper addresses the hypothesis that leveraging multiple pre-trained models can mitigate this recall reduction. We formulate the challenge of identifying and managing conflicting predictions from various models as a consistency-based abduction problem, building on the idea of abductive learning (ABL) but applying it to test-time instead of training. The input predictions and the learned error detection rules derived from each model are encoded in a logic program. We then seek an abductive explanation--a subset of model predictions--that maximizes prediction coverage while ensuring the rate of logical inconsistencies (derived from domain constraints) remains below a specified threshold. We propose two algorithms for this knowledge representation task: an exact method based on Integer Programming (IP) and an efficient Heuristic Search (HS). Through extensive experiments on a simulated aerial imagery dataset featuring controlled, complex distributional shifts, we demonstrate that our abduction-based framework outperforms individual models and standard ensemble baselines, achieving, for instance, average relative improvements of approximately 13.6\% in F1-score and 16.6\% in accuracy across 15 diverse test datasets when compared to the best individual model. Our results validate the use of consistency-based abduction as an effective mechanism to robustly integrate knowledge from multiple imperfect models in challenging, novel scenarios.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- South America > Argentina (0.04)
- North America > United States > New York > Onondaga County > Syracuse (0.04)
- (3 more...)
- Government > Regional Government > North America Government > United States Government (1.00)
- Government > Military (1.00)
How to Correctly Report LLM-as-a-Judge Evaluations
Lee, Chungpa, Zeng, Thomas, Jeong, Jongwon, Sohn, Jy-yong, Lee, Kangwook
Large language models (LLMs) are increasingly used as evaluators in lieu of humans. While scalable, their judgments are noisy due to imperfect specificity and sensitivity of LLMs, leading to biased accuracy estimates. Although bias-correction methods exist, they are underutilized in LLM research and typically assume exact knowledge of the model's specificity and sensitivity. Furthermore, in general we only have estimates of these values and it is not well known how to properly construct confidence intervals using only estimates. This work presents a simple plug-in framework that corrects such bias and constructs confidence intervals reflecting uncertainty from both test and calibration dataset, enabling practical and statistically sound LLM-based evaluation. Additionally, to reduce uncertainty in the accuracy estimate, we introduce an adaptive algorithm that efficiently allocates calibration sample sizes.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
Differentiable Physics-Neural Models enable Learning of Non-Markovian Closures for Accelerated Coarse-Grained Physics Simulations
Xue, Tingkai, Ooi, Chin Chun, Ge, Zhengwei, Leong, Fong Yew, Li, Hongying, Kang, Chang Wei
Numerical simulations provide key insights into many physical, real-world problems. However, while these simulations are solved on a full 3D domain, most analysis only require a reduced set of metrics (e.g. plane-level concentrations). This work presents a hybrid physics-neural model that predicts scalar transport in a complex domain orders of magnitude faster than the 3D simulation (from hours to less than 1 min). This end-to-end differentiable framework jointly learns the physical model parameterization (i.e. orthotropic diffusivity) and a non-Markovian neural closure model to capture unresolved, 'coarse-grained' effects, thereby enabling stable, long time horizon rollouts. This proposed model is data-efficient (learning with 26 training data), and can be flexibly extended to an out-of-distribution scenario (with a moving source), achieving a Spearman correlation coefficient of 0.96 at the final simulation time. Overall results show that this differentiable physics-neural framework enables fast, accurate, and generalizable coarse-grained surrogates for physical phenomena.
- North America > United States (0.14)
- Europe > Romania > Black Sea (0.04)
- Europe > United Kingdom > North Sea > Southern North Sea (0.04)
- Asia > Singapore > Central Region > Singapore (0.04)
XiCAD: Camera Activation Detection in the Da Vinci Xi User Interface
Jenke, Alexander C., Just, Gregor, de Boer, Claas, Wagner, Martin, Bodenstedt, Sebastian, Speidel, Stefanie
Purpose: Robot-assisted minimally invasive surgery relies on endoscopic video as the sole intraoperative visual feedback. The DaVinci Xi system overlays a graphical user interface (UI) that indicates the state of each robotic arm, including the activation of the endoscope arm. Detecting this activation provides valuable metadata such as camera movement information, which can support downstream surgical data science tasks including tool tracking, skill assessment, or camera control automation. Methods: We developed a lightweight pipeline based on a ResNet18 convolutional neural network to automatically identify the position of the camera tile and its activation state within the DaVinci Xi UI. The model was fine-tuned on manually annotated data from the SurgToolLoc dataset and evaluated across three public datasets comprising over 70,000 frames. Results: The model achieved F1-scores between 0.993 and 1.000 for the binary detection of active cameras and correctly localized the camera tile in all cases without false multiple-camera detections. Conclusion: The proposed pipeline enables reliable, real-time extraction of camera activation metadata from surgical videos, facilitating automated preprocessing and analysis for diverse downstream applications. All code, trained models, and annotations are publicly available.
- Health & Medicine (1.00)
- Media > Television (0.35)
- Media > Photography (0.35)
- Media > Film (0.35)
- Information Technology > Human Computer Interaction > Interfaces (1.00)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)
- North America > United States > Virginia (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- North America > Canada > Ontario > Toronto (0.16)
- North America > United States > Maryland > Montgomery County > Bethesda (0.04)
Modeling X-ray photon pile-up with a normalizing flow
König, Ole, Huppenkothen, Daniela, Finkbeiner, Douglas, Kirsch, Christian, Wilms, Jörn, Yang, Justina R., Steiner, James F., Martínez-Galarza, Juan Rafael
The dynamic range of imaging detectors flown on-board X-ray observatories often only covers a limited flux range of extrasolar X-ray sources. The analysis of bright X-ray sources is complicated by so-called pile-up, which results from high incident photon flux. This nonlinear effect distorts the measured spectrum, resulting in biases in the inferred physical parameters, and can even lead to a complete signal loss in extreme cases. Piled-up data are commonly discarded due to resulting intractability of the likelihood. As a result, a large number of archival observations remain underexplored. We present a machine learning solution to this problem, using a simulation-based inference framework that allows us to estimate posterior distributions of physical source parameters from piled-up eROSITA data. We show that a normalizing flow produces better-constrained posterior densities than traditional mitigation techniques, as more data can be leveraged. We consider model- and calibration-dependent uncertainties and the applicability of such an algorithm to real data in the eROSITA archive.
- Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.14)
- North America > Canada (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > France (0.04)